Introduction to R

Brief introduction for STATA users

Data Analytics Unit

Intro

What is R?

  • R is a high-level programming language for statistical computing and data visualization.
  • It is a dialect of the S programming language.
  • It is the second most popular programming language for data science just behind Python.
  • Paradigm:
    • Functional Programming
    • Object-oriented
  • Latest version: R 4.3.0 launched in April, 2023

Popularity

  • According to Google Trends, R is around 10 times more popular than Stata in Google Searches:

RStudio I

  • R is a language NOT a program.
  • You can interact directly with your computer by opening the terminal and write there.
  • People usually use an Integrated Development Environment (IDE) to interact with their machines using R.
  • The most commonly IDE when working with R is RStudio
  • Latest version for Mac: 2023.03.1

RStudio II

R Scripts

  • Scripts are the most important part of RStudio.
  • Equivalent to do-files in Stata.
  • Cornerstone of reproducibility.

R vs Stata

Data and Objects

Stata

  • You can have only one data set open at a time
  • You can save matrices, globals and locals as secondary objects

R

  • You can load multiple data sets as your memory allows it.
  • Everything you load or save is saved as an object in the Global Environment
  • You haave a wide range of data types at your disposal: data frames, functions, vectors, values, lists, tupples, matrices, etc.

Global Environment

Base Functionalities

Stata

  • Stata is mainly developed by STATA Corp
  • Most of the commands you use on a daily basis are out-of-the-box (2GB)
  • Commands written by the community are secondary
  • You install commands using ssc install
  • Commands are always ready to be used

R

  • Most popular packages are maintained (not developed) by Posit
  • R comes with a very limited set of functions in the R base version (90MB)
  • You depend a lot of the community and their developments
  • You install packages using install.packages()
  • Packages come with a wide range of functions

Libraries

  • You always need to load packages outside the R base version:
library(tidyverse)
library(sf)
  • We use the pacman library to load our packages:
# Required packages
library(pacman)

p_load(char = c(
  # Visualizations
  "showtext", "ggtext", "ggsankey", "patchwork",
  
  # Data Loading
  "haven", "readxl",
  
  # GIS
  "tmaptools", "rmapshaper", "sf",
  
  # Good 'ol Tidyverse
  "tidyverse"
  
))

The Tidyverse

  • The tidyverse is a collection of packages designed for data science.
  • These packages are actively developed by the community and maintained by Posit.
  • They share an underlying design philosophy, grammar, and data structures.
  • Due to how they complement each other, it is very difficult to distinguish between individual packages.
  • They have come to replace many of the functionalities from the base version of R.
  • Due to their importance, they are always loaded in every script.
  • More information on tidyverse.org

Base R vs Tidyverse

Base R:

# Loading CSV data
dataFrame <- read.csv("data.csv")

# Creating a new variable
dataFrame$age_square <- age^2

# Subsetting the data frame
dataFrame <- dataFrame[dataFrame$age_square >= 250, ]

# Merging
merge(dataFrame, aux_data, by = "id")

Tidyverse:

# Loading CSV data
dataFrame <- read_csv("data.csv")

# Creating a variable, subsetting and merging:
dataFrame <- dataFrame %>%
  mutate(age_square = age^2) %>%
  filter(age_square >= 250) %>%
  left_join(aux_data, by = "id")

Stata vs Tidyverse

Stata:

# Loading CSV data
import delimited "data.csv", clear

# Creating a new variable
gen age^2

# Subsetting the data frame
keep if age_square >= 250

# Merging
merge 1:1 id using "aux_data"

# Collapsing data
collapse (mean) age_square, by(country)

Tidyverse:

# Loading CSV data
dataFrame <- read_csv("data.csv")

# Manipulating data 
dataFrame_new <- dataFrame %>%     # Saving as a different object
  mutate(age_square = age^2) %>%
  filter(age_square >= 250) %>%
  left_join(aux_data, by = "id") %>%
  group_by(country) %>%
  summarise(age_square = mean(age_square))

Functional Programming

Stata

  • Stata is designed around commands in which you define options and arguments.
  • Thanks to the ado and mata languages, you can write your own commands.
  • All commands must be named using program define
program define my_function
  display "Hello, world"
end

R

  • R is designed around functions where you only define arguemts.
  • R allows you to write named and lambda functions.
  • Lambda functions are functions that you write on the go and directly apply to the data without having to save it or write a script just for the function.
  • You create functions using function()
function(){
  print("Hello, world")
}

Overall Comparison

Stata

  • Syntax has a low learning curve
  • Highly specialized, straightforward commands
  • Slow development of new features
  • Limited harmonization with other languages and tools: python, markdown, git, html

R

  • Syntax has a very steep learning curve
  • Less specialized, but more flexible options
  • Faster development of new packages
  • Advanced capabilities and harmonization with other languages and tools: python, markdown, git, html

Workflow

Git and GitHub

  • Git is a free and open source software that allows users to set up a version control system.
  • Given the nature of its features, it is normally used for collaboratively developing code and data integrity.
  • GitHub is a website and cloud-based service that helps developers store and manage their code, as well as track and control changes to their code through git.
  • The workflow in the Data Analytics Unit starts and finishes with Git.

GitHub

Workflow Definition and Features

  • A workflow is a system for managing repetitive processes and tasks which occur in a particular order.
    • Output recognition
    • Process understanding
    • Tasks that compose the process
    • Tasks simplification
  • Why we need so much structure? Human Rights Data Analysis Group
  • We can learn from the past (record and improve our process)
  • We can read each other work
  • We can test whether what we’ve done is correct

Our Workflow

Explaining the Cluster Analysis

  • We can learn from the past (record and improve our process)
  • We can read each other work
  • We can test whether what we’ve done is correct

Our Workflow

Cluster Analysis

Explaining the Cluster Analysis

  • First we load all the libraries we need, and also we load the data base that contains all the historical scores.
# Loading libraries
library(pacman)
p_load(readxl, 
       tidyverse, 
       factoextra, 
       cluster, 
       NbClust, 
       FactoMineR, 
       dendextend)

# Loading data base with all the scores

master_data.df <- read_csv("Data/master_data.csv") %>%
  mutate(region4labels = str_replace_all(region, "&", "&\n"))

Explaining the Cluster Analysis

  • We prepare the data. We need a data base with the rate of the scores between 2015 and 2022 per country.
factors_data <- master_data.df %>%
    select(year, country, starts_with("fct"), roli_index) %>% 
  filter(year == 2015 | year == 2022) %>% # This is the period that we are studying
  dplyr::rename(factor_1 = fct1,
                factor_2 = fct2,
                factor_3 = fct3,
                factor_4 = fct4,
                factor_5 = fct5,
                factor_6 = fct6,
                factor_7 = fct7,
                factor_8 = fct8) %>% # We reshape the data base to make easier the estimations of the rates
  pivot_wider(id_cols = country, names_from = year, 
              values_from = c(factor_1, factor_2, factor_3, factor_4, factor_5, 
                              factor_6, factor_7, factor_8, roli_index,
                              )) %>% # We estimate the rate changes between 2015 and 2022 per factor
  mutate(differences_factor_1 = ((factor_1_2022 - factor_1_2015)/factor_1_2015), 
         differences_factor_2 = ((factor_2_2022 - factor_2_2015)/factor_2_2015),
         differences_factor_3 = ((factor_3_2022 - factor_3_2015)/factor_3_2015),
         differences_factor_4 = ((factor_4_2022 - factor_4_2015)/factor_4_2015),
         differences_factor_5 = ((factor_5_2022 - factor_5_2015)/factor_5_2015),
         differences_factor_6 = ((factor_6_2022 - factor_6_2015)/factor_6_2015),
         differences_factor_7 = ((factor_7_2022 - factor_7_2015)/factor_7_2015),
         differences_factor_8 = ((factor_8_2022 - factor_8_2015)/factor_8_2015),
         differences_roli = (roli_index_2022 - roli_index_2015)/roli_index_2015) %>%
  select(country, starts_with("differences"))

Explaining the Cluster Analysis

We do some tests to define the optimal number of clusters. This is based on the independent information provided by each one.

fviz_nbclust(cluster_data, kmeans, method = 'wss') # According to the graph the main change 
                                                   # is produced after the third cluster
fviz_nbclust(cluster_data, kmeans, method = 'silhouette') # According to this method the optimal 
                                                          # number of cluster is two

Explaining the Cluster Analysis

We constructed each cluster using the Hierarchical Clustering Method and determined the distribution pattern of each country.

res2 <- hcut(cluster_data, k = 3, stand = T, hc_metric = 'euclidean') # We use the package FactoMineR

Main results of the Cluster Analysis

Main results of the Cluster Analysis

Resources for learning and troubleshooting

Resources for learning

Resources for troubleshooting